In this technical report, I am showing my step-by-step strategy for predicting whether a firm will have fast growth or not.
Adding necessary libraries and data. This initial data has 287829 observations and 48 variables.
Let's start with preliminary exploration. Some variables with lots of NAs (>80%) can be automatically excluded.
to_filter <- sapply(data, function(x) sum(is.na(x)))
to_filter[to_filter > 0]
## COGS amort curr_assets curr_liab
## 269572 8040 131 131
## extra_exp extra_inc extra_profit_loss finished_prod
## 18529 18529 17203 270344
## fixed_assets inc_bef_tax intang_assets inventories
## 131 7437 140 131
## liq_assets material_exp net_dom_sales net_exp_sales
## 131 8040 269572 269572
## personnel_exp profit_loss_year sales share_eq
## 8040 9884 7437 131
## subscribed_cap tang_assets wages D
## 131 1470 269846 287829
## founded_year exit_year ceo_count foreign
## 56457 248970 56427 56427
## female birth_year inoffice_days nace_main
## 56427 111818 56427 1038
## ind2 ind founded_date exit_date
## 1038 9769 51 231649
## labor_avg
## 146532
## variables with more than 80% of NAs: COGS, finished_prod, net_dom_sales, net_exp_sales, wages
## they can be deleted
data <- data %>%
select(-c(COGS, finished_prod, net_dom_sales, net_exp_sales, wages))
The task requires to work with data from 2010 to 2015. I also decided to focus on small and medium size enterprises (up to 50 million euro sales according to the European Commission's definition), because very large firms may have different potential for growth (and very large sale numbers - the dependent variable that I use in this analysis, but more information is presented in the next section).
data_filtered <- data %>%
filter(year %in% c(2010:2015))
NROW(unique(data_filtered$comp_id))
## [1] 39375
## 167606 observations now, 39375 unique companies
data_filtered <- data_filtered %>%
filter(sales >= 1000 & sales <= 50000000)
NROW(unique(data_filtered$comp_id))
## [1] 33703
## 129491 observations now, 33703 unique companies
The main dependent variable is a fast growth of firms. It can be defined in different ways: as changes in incomes, changes in current assets, total sales. I am more inclined to use a total sales indicator, because it is an economic variables that allows easier understanding of the firms' development: if it is successful and growing, then firms will sale more production. Incomes do not always allow to reflect the firms' growth: it can sell its assets/ fire workers and receive more incomes / reduce expenditures this way, but it is usually happenning not because of the company's growth.
For my analysis, I am taking a percentage difference in total sales (percentage allows to take into consideration different firms sizes / different types of products). The timeframe is two years: a percentage difference in sales between 2012 and 2014, in order to reduce possible one-year fluctiations and look at a wider time range. A disadvantage of sales as a dependent variable is that many other exogenous confounders can affect it (for example, economic or political crisis leads to lower incomes and reduced purchasing power, which directly affects sales); however, let's assume that non of similar significant events happened between 2012 and 2014 in the United States (which is quite true in reality, I believe).
data_filtered_wide <- data_filtered %>%
pivot_wider(id_cols = c(comp_id, year), names_from = year, values_from = sales) %>%
mutate(total_change = `2014` - `2012`,
perc_change = total_change/`2012`*100)
summary(data_filtered_wide$perc_change)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -99.8 -17.5 12.8 196.8 58.1 509617.1 16626
There were some NAs (16626 firms - since transformed to wide data), so I decided to impute in cases when it was possible.
data_filtered_wide <- data_filtered %>%
pivot_wider(id_cols = c(comp_id, year), names_from = year, values_from = sales) %>%
mutate(imputed = ifelse(is.na(`2012`) == TRUE | is.na(`2014`) == TRUE, 1, 0)) %>%
mutate(`2012` = ifelse(is.na(`2012`) == TRUE, ifelse(is.na(`2011`) == FALSE & is.na(`2013`) == FALSE, (`2011` + `2013`)/2, NA), `2012`),
`2012` = ifelse(is.na(`2012`) == TRUE, ifelse(is.na(`2011`) == TRUE & is.na(`2013`) == FALSE, `2013`, NA), `2012`),
`2014` = ifelse(is.na(`2014`) == TRUE, ifelse(is.na(`2013`) == FALSE & is.na(`2015`) == FALSE, (`2013` + `2015`)/2, NA), `2014`),
`2014` = ifelse(is.na(`2014`) == TRUE, ifelse(is.na(`2013`) == TRUE & is.na(`2015`) == FALSE, `2015`, NA), `2014`)) %>%
mutate(total_change = `2014` - `2012`,
perc_change = total_change/`2012`*100)
summary(data_filtered_wide$perc_change)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -100.0 -15.9 15.9 247.2 72.8 509617.1 14041
A number of missing firms was reduced to 14041. Since I do not think it is possible to impute further, all other NAs are excluded from my analysis. Filtered dataset (back to long format) contains 102932 observations and 19662 unique firms.
data_filtered <- data_filtered_wide %>%
select(comp_id, imputed, total_change, perc_change) %>%
right_join(data_filtered, by = "comp_id")
data_filtered <- data_filtered[!is.na(data_filtered$perc_change),]
summary(data_filtered$perc_change)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100.0 -16.4 13.5 203.6 60.8 509617.1
NROW(unique(data_filtered$comp_id))
## [1] 19662
## 102932 observations now, 19662 firms
quantile(data_filtered$perc_change, probs = c(0.65))
## 65%
## 35.02251
data_filtered %>%
filter(perc_change <= 1000) %>%
ggplot(aes(x = perc_change)) + geom_density() +
geom_vline(xintercept = 35, color = "blue")
Finally, I create a fast growth dummy - the main dependent variable of my analysis. Initially, I wanted to use a 3 quartile to define a fast growth company, which is 60. However, I believe that 40 or 50 percent increase in sales over two years can also be defined as fast. Thus, I am using a 65 percentile instead of a 3 quartile (= 35% growth in sales). It allows me to distribute data in a decently equal way (65% to 35%), and from the empirical perspective, I believe that 35% growth in sales over two years can be classified as fast. 7432 firms are classified as fast-growing, 12230 as not fast-growing.
# creating fast growth binary
data_filtered <- data_filtered %>%
mutate(fast_growth = ifelse(perc_change >= 35, 1, 0))
data_filtered$fast_growth_f <- as.factor(data_filtered$fast_growth)
summary(data_filtered$fast_growth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3502 1.0000 1.0000
table(data_filtered$fast_growth_f)
##
## 0 1
## 66882 36050
data_filtered %>%
select(comp_id, fast_growth) %>%
unique() %>%
group_by(fast_growth) %>%
dplyr::summarize(n_firms = n())
## # A tibble: 2 × 2
## fast_growth n_firms
## <dbl> <int>
## 1 0 12230
## 2 1 7432
First, I decided to distribute independent variables in several categories.
Fixed Effects and other relevant characteristics
Firm's Performance: "ln_amort", "amort_flag", "ln_curr_assets", "ln_fixed_assets", "ln_intang_assets", "ln_liq_assets", "ln_tang_assets", "flag_asset_problem", "ln_curr_liab", "ln_extra_exp", "ln_extra_inc", "inc_bef_tax_st", "ln_inventories", "ln_material_exp", "ln_personnel_exp", "profit_loss_year_st", "ln_subscribed_cap", "flag_error"
In many cases, I use natural logarithms instead of real values, because it helps to neutralise the effect of large values and make distributions look more Gaussian. Also, some financial indicators (for example, assets) cannot have values below 0, that is why I imputed them and flagged this imputation. Furthemore, I standartised and winsorised two variables (income before taxes and profit_loss_year). All NAs (which number usually was small for all variables) were imputed.
## amort
data_filtered %>%
ggplot(aes(x = amort)) + geom_density()
summary(data_filtered$amort)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -57800 259 1374 18662 5681 6597160 199
# 199 NAs - not so many, so can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(amort = ifelse(is.na(amort), median(amort, na.rm = T), amort),
amort_flag = ifelse(amort < 0, 1, 0),
amort = ifelse(amort < 0, 0, amort),
ln_amort = log(amort + 1))
data_filtered %>%
ggplot(aes(x = ln_amort)) + geom_density()
summary(data_filtered$flag_error)
## Length Class Mode
## 0 NULL NULL
# flagging if assets < 0
data_filtered <- data_filtered %>%
mutate(flag_asset_problem = ifelse(intang_assets < 0 | curr_assets < 0 | fixed_assets < 0 | liq_assets < 0 | tang_assets < 0, 1, 0))
## curr_assets
data_filtered %>%
ggplot(aes(x = curr_assets)) + geom_density()
summary(data_filtered$curr_assets)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -34600 4711 15530 206644 56000 141101504 19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(curr_assets = ifelse(is.na(curr_assets), median(curr_assets, na.rm = T), curr_assets),
curr_assets = ifelse(curr_assets < 0, 0, curr_assets),
ln_curr_assets = log(curr_assets + 1))
data_filtered %>%
ggplot(aes(x = ln_curr_assets)) + geom_density()
## fixed_assets
data_filtered %>%
ggplot(aes(x = fixed_assets)) + geom_density()
summary(data_filtered$fixed_assets)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -7341 663 8374 305925 61241 2696748032 19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(fixed_assets = ifelse(is.na(fixed_assets), median(fixed_assets, na.rm = T), fixed_assets),
fixed_assets = ifelse(fixed_assets < 0, 0, fixed_assets),
ln_fixed_assets = log(fixed_assets + 1))
data_filtered %>%
ggplot(aes(x = ln_fixed_assets)) + geom_density()
## intang_assets
data_filtered %>%
ggplot(aes(x = intang_assets)) + geom_density()
summary(data_filtered$intang_assets)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -56 0 0 6335 0 10483637 19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(intang_assets = ifelse(is.na(intang_assets), median(intang_assets, na.rm = T), intang_assets),
intang_assets = ifelse(intang_assets < 0, 0, intang_assets),
ln_intang_assets = log(intang_assets + 1))
data_filtered %>%
ggplot(aes(x = ln_intang_assets)) + geom_density()
## liq_assets
data_filtered %>%
ggplot(aes(x = liq_assets)) + geom_density()
summary(data_filtered$liq_assets)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -83589 719 3111 49712 13900 90536320 19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(liq_assets = ifelse(is.na(liq_assets), median(liq_assets, na.rm = T), liq_assets),
liq_assets = ifelse(liq_assets < 0, 0, liq_assets),
ln_liq_assets = log(liq_assets + 1))
data_filtered %>%
ggplot(aes(x = ln_liq_assets)) + geom_density()
## tang_assets
data_filtered %>%
ggplot(aes(x = tang_assets)) + geom_density()
summary(data_filtered$tang_assets)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -10896 574 7593 244436 56593 120959248 19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(tang_assets = ifelse(is.na(tang_assets), median(tang_assets, na.rm = T), tang_assets),
tang_assets = ifelse(tang_assets < 0, 0, tang_assets),
ln_tang_assets = log(tang_assets + 1))
data_filtered %>%
ggplot(aes(x = ln_tang_assets)) + geom_density()
# flagging if other non-negative variables < 0
data_filtered <- data_filtered %>%
mutate(flag_error = ifelse(curr_liab < 0 | extra_exp < 0 | extra_inc < 0 | inventories < 0 | material_exp < 0 | personnel_exp < 0 | subscribed_cap < 0, 1, 0))
# curr_liab
data_filtered %>%
ggplot(aes(x = curr_liab)) + geom_density()
summary(data_filtered$curr_liab)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -22778 3874 16367 160693 57311 137662528 19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(curr_liab = ifelse(is.na(curr_liab), median(curr_liab, na.rm = T), curr_liab),
curr_liab = ifelse(curr_liab < 0, 0, curr_liab),
ln_curr_liab = log(curr_liab + 1))
data_filtered %>%
ggplot(aes(x = ln_curr_liab)) + geom_density()
## extra_exp
data_filtered %>%
ggplot(aes(x = extra_exp)) + geom_density()
summary(data_filtered$extra_exp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -133585 0 0 1599 0 13918115 3
# 3 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(extra_exp = ifelse(is.na(extra_exp), median(extra_exp, na.rm = T), extra_exp),
extra_exp = ifelse(extra_exp < 0, 0, extra_exp),
ln_extra_exp = log(extra_exp + 1))
data_filtered %>%
ggplot(aes(x = ln_extra_exp)) + geom_density()
## extra_inc
data_filtered %>%
ggplot(aes(x = extra_inc)) + geom_density()
summary(data_filtered$extra_inc)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -274407 0 0 3295 0 15686481 3
# 3 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(extra_inc = ifelse(is.na(extra_inc), median(extra_inc, na.rm = T), extra_inc),
extra_inc = ifelse(extra_inc < 0, 0, extra_inc),
ln_extra_inc = log(extra_inc + 1))
data_filtered %>%
ggplot(aes(x = ln_extra_inc)) + geom_density()
## inc_bef_tax
data_filtered %>%
ggplot(aes(x = inc_bef_tax)) + geom_density()
summary(data_filtered$inc_bef_tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -17102708 -3570 889 21334 7459 13840463
# in this case, I decided to standartise and then winsorized it
data_filtered <- data_filtered %>%
mutate(inc_bef_tax_st = (inc_bef_tax - mean(inc_bef_tax))/sd(inc_bef_tax))
data_filtered$inc_bef_tax_st <- Winsorize(data_filtered$inc_bef_tax_st, minval = NULL, maxval = NULL, probs = c(0.05, 0.95),
na.rm = FALSE, type = 7)
data_filtered %>%
ggplot(aes(x = inc_bef_tax_st)) + geom_density()
## inventories
data_filtered %>%
ggplot(aes(x = inventories)) + geom_density()
summary(data_filtered$inventories)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -2607 0 1589 59118 9252 23653078 19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(inventories = ifelse(is.na(inventories), median(inventories, na.rm = T), inventories),
inventories = ifelse(inventories < 0, 0, inventories),
ln_inventories = log(inventories + 1))
data_filtered %>%
ggplot(aes(x = ln_inventories)) + geom_density()
## material_exp
data_filtered %>%
ggplot(aes(x = material_exp)) + geom_density()
summary(data_filtered$material_exp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1097800 12922 40744 345940 127715 44773684 199
# 199 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(material_exp = ifelse(is.na(material_exp), median(material_exp, na.rm = T), material_exp),
material_exp = ifelse(material_exp < 0, 0, material_exp),
ln_material_exp = log(material_exp + 1))
data_filtered %>%
ggplot(aes(x = ln_material_exp)) + geom_density()
## personnel_exp
data_filtered %>%
ggplot(aes(x = personnel_exp)) + geom_density()
summary(data_filtered$personnel_exp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -163367 4874 13533 99262 38600 23893608 199
# 199 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(personnel_exp = ifelse(is.na(personnel_exp), median(personnel_exp, na.rm = T), personnel_exp),
personnel_exp = ifelse(personnel_exp < 0, 0, personnel_exp),
ln_personnel_exp = log(personnel_exp + 1))
data_filtered %>%
ggplot(aes(x = ln_personnel_exp)) + geom_density()
## profit_loss_year
data_filtered %>%
ggplot(aes(x = profit_loss_year)) + geom_density()
summary(data_filtered$profit_loss_year)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -17102708 -3663 444 8981 4504 13827796 25
# 25 NAs - not so many, so again can impute with a median; standartisation and winsorising
data_filtered <- data_filtered %>%
mutate(profit_loss_year = ifelse(is.na(profit_loss_year), median(profit_loss_year, na.rm = T), profit_loss_year),
profit_loss_year_st = (profit_loss_year - mean(profit_loss_year))/sd(profit_loss_year))
data_filtered$profit_loss_year_st <- Winsorize(data_filtered$profit_loss_year_st, minval = NULL, maxval = NULL, probs = c(0.05, 0.95),
na.rm = FALSE, type = 7)
data_filtered %>%
ggplot(aes(x = profit_loss_year_st)) + geom_density()
## subscribed_cap
data_filtered %>%
ggplot(aes(x = subscribed_cap)) + geom_density()
summary(data_filtered$subscribed_cap)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -185 1852 6467 66764 11111 46222532 19
# 19 NAs - not so many, so again can impute with a median; variables under 0 can be changed to 0 with flagging it
data_filtered <- data_filtered %>%
mutate(subscribed_cap = ifelse(is.na(subscribed_cap), median(subscribed_cap, na.rm = T), subscribed_cap),
subscribed_cap = ifelse(subscribed_cap < 0, 0, subscribed_cap),
ln_subscribed_cap = log(subscribed_cap + 1))
data_filtered %>%
ggplot(aes(x = ln_subscribed_cap)) + geom_density()
## set of numeric variables
financial_perform <- c("ln_amort", "amort_flag", "ln_curr_assets", "ln_fixed_assets", "ln_intang_assets", "ln_liq_assets", "ln_tang_assets", "flag_asset_problem", "ln_curr_liab", "ln_extra_exp", "ln_extra_inc", "inc_bef_tax_st", "ln_inventories", "ln_material_exp", "ln_personnel_exp", "profit_loss_year_st", "ln_subscribed_cap", "flag_error")
## set of variables related to the people
### ceo_count
data_filtered %>%
ggplot(aes(x = ceo_count)) + geom_bar()
summary(data_filtered$ceo_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 1.000 1.294 2.000 15.000 7944
# let's unite all categories with 3 and more CEOs together; for NAs, assume it equals to the median (1 CEO) + flag them
data_filtered <- data_filtered %>%
mutate(ceo_count_flag = ifelse(is.na(ceo_count), 1, 0),
ceo_count = ifelse(is.na(ceo_count), median(ceo_count, na.rm = T), ceo_count),
ceo_count = as.factor(ifelse(ceo_count >= 3, 3, ceo_count)))
data_filtered %>%
ggplot(aes(x = ceo_count)) + geom_bar()
### foreign CEO share
data_filtered %>%
ggplot(aes(x = foreign)) + geom_density()
summary(data_filtered$foreign)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.103 0.000 1.000 7944
# let's transform NAs to mean (since it is share) and flag them
data_filtered <- data_filtered %>%
mutate(foreign_flag = ifelse(is.na(foreign), 1, 0),
foreign = ifelse(is.na(foreign), mean(foreign, na.rm = T), foreign))
data_filtered %>%
ggplot(aes(x = foreign)) + geom_bar()
### female CEO share
data_filtered %>%
ggplot(aes(x = female)) + geom_density()
summary(data_filtered$female)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.251 0.500 1.000 7944
# in the same way, let's transform NAs to mean (since it is share) and flag them
data_filtered <- data_filtered %>%
mutate(female_flag = ifelse(is.na(female), 1, 0),
female = ifelse(is.na(female), mean(female, na.rm = T), female))
data_filtered %>%
ggplot(aes(x = female)) + geom_bar()
# in the same way, let's transform NAs to median and flag them
data_filtered <- data_filtered %>%
mutate(birth_year_flag = ifelse(is.na(birth_year), 1, 0),
birth_year = ifelse(is.na(birth_year), mean(birth_year, na.rm = T), birth_year),
birth_year = floor(birth_year),
birth_year = as.factor(birth_year))
data_filtered %>%
ggplot(aes(x = birth_year)) + geom_bar()
### inoffice_days
data_filtered %>%
ggplot(aes(x = inoffice_days)) + geom_bar()
# table(data_filtered$inoffice_days)
summary(data_filtered$inoffice_days)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10 1858 2718 3108 3706 10320 7944
# in the same way, let's transform NAs to median and flag them
data_filtered <- data_filtered %>%
mutate(inoffice_days_flag = ifelse(is.na(inoffice_days), 1, 0),
inoffice_days = ifelse(is.na(inoffice_days), mean(inoffice_days, na.rm = T), inoffice_days))
data_filtered %>%
ggplot(aes(x = inoffice_days)) + geom_bar()
### gender
data_filtered %>%
ggplot(aes(x = gender)) + geom_bar()
table(data_filtered$gender)
##
## female male mix
## 7944 18163 65341 11484
summary(data_filtered$gender)
## Length Class Mode
## 102932 character character
# here, I think NAs can be used as a reference category
data_filtered$gender <- fct_recode(data_filtered$gender, "zero" = "")
data_filtered <- data_filtered %>%
mutate(gender = as.factor(gender),
gender = relevel(gender, ref = "zero"))
### origin
data_filtered %>%
ggplot(aes(x = origin)) + geom_bar()
table(data_filtered$origin)
##
## Domestic Foreign mix
## 7944 83730 8414 2844
summary(data_filtered$origin)
## Length Class Mode
## 102932 character character
# NAs also can be used as a reference category
data_filtered$origin <- fct_recode(data_filtered$origin, "zero" = "")
data_filtered <- data_filtered %>%
mutate(origin = as.factor(origin),
origin = relevel(origin, ref = "zero"))
hr <- c("ceo_count", "ceo_count_flag", "foreign", "female",
"inoffice_days", "gender", "origin")
### urban_m
data_filtered %>%
ggplot(aes(x = urban_m)) + geom_bar()
data_filtered <- data_filtered %>%
mutate(urban_m = as.factor(urban_m),
urban_m = fct_recode(urban_m, "capital" = "1", "other big" = "2", "other" = "3"))
data_filtered %>%
ggplot(aes(x = urban_m)) + geom_bar()
### region_m
data_filtered %>%
ggplot(aes(x = region_m)) + geom_bar()
table(data_filtered$region_m) # let's also use NAs as references
##
## Central East West
## 321 61143 25426 16042
data_filtered <- data_filtered %>%
mutate(region_m = factor(region_m, levels = c("Central", "East", "West"))) %>%
drop_na(region_m) # delete NAs
data_filtered %>%
ggplot(aes(x = region_m)) + geom_bar()
geography <- c("urban_m", "region_m")
data_filtered %>%
ggplot(aes(x = year)) + geom_bar()
data_filtered$year <- as.factor(data_filtered$year)
summary(data_filtered$year)
## 2010 2011 2012 2013 2014 2015
## 13753 15438 17309 19247 19241 17623
data_filtered %>%
ggplot(aes(x = ind2)) + geom_bar()
## Warning: Removed 34 rows containing non-finite values (stat_count).
## too many categories, some are too specific and small - need to make them "more common"; 34 NAs can create another category
data_filtered <- data_filtered %>%
mutate(ind2_cat = ind2 %>%
ifelse(. > 56, 60, .) %>%
ifelse(. < 26, 20, .) %>%
ifelse(. < 55 & . > 35, 40, .) %>%
ifelse(. == 31, 30, .) %>%
ifelse(is.na(.), 99, .)
)
data_filtered$ind2_cat <- as.factor(data_filtered$ind2_cat)
table(data_filtered$ind2_cat)
##
## 20 26 27 28 29 30 32 33 40 55 56 60 99
## 258 5860 3580 10592 1687 909 746 10076 1046 11482 55199 1142 34
data_filtered %>%
ggplot(aes(x = imputed)) + geom_bar()
data_filtered$imputed <- as.factor(data_filtered$imputed)
summary(data_filtered$imputed)
## 0 1
## 94579 8032
## set of basic variables
fe <- c("year", "ind2_cat", "imputed")
Finally, I am adding some polynomials and interactions.
# sq is needed
ggplot(data = data_filtered, aes(x = ln_amort, y = as.numeric(fast_growth))) +
geom_point(size = 2, shape = 20, stroke=2, fill = "blue", color = "blue") +
geom_smooth(method = "lm", formula = y ~ poly(x,2), color = color[4], se = F, size = 1)+
geom_smooth(method = "loess", se = F, colour = color[5], size = 1.5, span = 0.9) +
labs(x = "ln_amort", y = "fast_growth") +
theme_bg()
## `geom_smooth()` using formula 'y ~ x'
# sq is needed
ggplot(data = data_filtered, aes(x = ln_curr_assets, y = as.numeric(fast_growth))) +
geom_point(size = 2, shape = 20, stroke=2, fill = "blue", color = "blue") +
geom_smooth(method = "lm", formula = y ~ x, color = color[4], se = F, size = 1)+
geom_smooth(method = "loess", se = F, colour = color[5], size = 1.5, span = 0.9) +
labs(x = "ln_curr_assets", y = "fast_growth") +
theme_bg()
## `geom_smooth()` using formula 'y ~ x'
# it is better to include all squares, LASSO will decide
data_filtered <- data_filtered %>%
mutate(ln_amort2 = ln_amort^2,
ln_curr_assets2 = ln_curr_assets^2,
ln_curr_liab2 = ln_curr_liab^2,
ln_extra_exp2 = ln_extra_exp^2,
ln_extra_inc2 = ln_extra_inc^2,
ln_fixed_assets2 = ln_fixed_assets^2,
inc_bef_tax_st2 = inc_bef_tax_st^2,
ln_intang_assets2 = ln_intang_assets^2,
ln_inventories2 = ln_inventories^2,
ln_liq_assets2 = ln_liq_assets^2,
ln_material_exp2 = ln_material_exp^2,
ln_personnel_exp2 = ln_personnel_exp^2,
profit_loss_year_st2 = profit_loss_year_st^2,
ln_tang_assets2 = ln_tang_assets^2,
ln_subscribed_cap2 = ln_subscribed_cap^2)
polynomials <- c("ln_amort2", "ln_curr_assets2", "ln_curr_liab2", "ln_extra_exp2",
"ln_extra_inc2", "ln_fixed_assets2",
"inc_bef_tax_st2", "ln_intang_assets2", "ln_inventories2",
"ln_liq_assets2", "ln_material_exp2", "ln_personnel_exp2",
"profit_loss_year_st2", "ln_tang_assets2", "ln_subscribed_cap2")
interactions1 <- c("urban_m*ln_extra_exp", "urban_m*ln_personnel_exp", "urban_m*inc_bef_tax_st", "urban_m*ln_fixed_assets2",
"origin*ln_extra_exp", "origin*ln_personnel_exp", "origin*inc_bef_tax_st", "origin*ln_fixed_assets2",
"region_m*ln_extra_exp", "region_m*ln_personnel_exp", "region_m*inc_bef_tax_st", "region_m*ln_fixed_assets2",
"gender*ln_extra_exp", "gender*ln_personnel_exp", "gender*inc_bef_tax_st", "gender*ln_fixed_assets2")
interactions2 <- c("foreign*ln_extra_exp", "foreign*ln_personnel_exp", "foreign*inc_bef_tax_st", "foreign*ln_fixed_assets2",
"female*ln_extra_exp", "female*ln_personnel_exp", "female*inc_bef_tax_st", "female*ln_fixed_assets2")
I build 6 basic models.
X1 <- c(financial_perform)
X2 <- c(financial_perform, hr)
X3 <- c(financial_perform, hr, geography)
X4 <- c(financial_perform, hr, geography, fe)
X5 <- c(financial_perform, hr, geography, fe, polynomials)
X6 <- c(financial_perform, hr, geography, fe, polynomials, interactions1, interactions2)
# for LASSO
logitvars <- c(financial_perform, hr, geography, fe, polynomials, interactions1, interactions2)
# for random forest
rfvars <- c(financial_perform, hr, geography, fe)
Now, I try to compare results of OLS and logit.
# ols - model 1
ols_modelx1 <- lm(formula(paste0("fast_growth ~ ", paste0(X1, collapse = " + "))),
data = data_filtered)
summary(ols_modelx1)
##
## Call:
## lm(formula = formula(paste0("fast_growth ~ ", paste0(X1, collapse = " + "))),
## data = data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5630 -0.3551 -0.3196 0.6251 0.8808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3165551 0.0122624 25.815 < 2e-16 ***
## ln_amort 0.0005861 0.0007831 0.748 0.454189
## amort_flag -0.1233684 0.0907240 -1.360 0.173889
## ln_curr_assets -0.0055367 0.0014194 -3.901 9.59e-05 ***
## ln_fixed_assets -0.0027046 0.0019025 -1.422 0.155136
## ln_intang_assets -0.0012043 0.0006190 -1.946 0.051699 .
## ln_liq_assets 0.0033591 0.0009885 3.398 0.000679 ***
## ln_tang_assets -0.0053018 0.0018681 -2.838 0.004540 **
## flag_asset_problem -0.0111655 0.0483724 -0.231 0.817453
## ln_curr_liab 0.0086665 0.0008319 10.417 < 2e-16 ***
## ln_extra_exp -0.0051823 0.0008287 -6.254 4.02e-10 ***
## ln_extra_inc -0.0026911 0.0006453 -4.170 3.04e-05 ***
## inc_bef_tax_st -0.0753540 0.0303073 -2.486 0.012908 *
## ln_inventories -0.0030928 0.0004365 -7.086 1.39e-12 ***
## ln_material_exp 0.0133353 0.0013203 10.100 < 2e-16 ***
## ln_personnel_exp -0.0065671 0.0006457 -10.170 < 2e-16 ***
## profit_loss_year_st 0.2070615 0.0359633 5.758 8.56e-09 ***
## ln_subscribed_cap -0.0019209 0.0008703 -2.207 0.027308 *
## flag_error 0.0331386 0.0462199 0.717 0.473390
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4753 on 102380 degrees of freedom
## (212 observations deleted due to missingness)
## Multiple R-squared: 0.008024, Adjusted R-squared: 0.00785
## F-statistic: 46.01 on 18 and 102380 DF, p-value: < 2.2e-16
# logit - model 1
glm_modelx1 <- glm(formula(paste0("fast_growth_f ~", paste0(X1, collapse = " + "))),
data = data_filtered, family = "binomial")
summary(glm_modelx1)
##
## Call:
## glm(formula = formula(paste0("fast_growth_f ~", paste0(X1, collapse = " + "))),
## family = "binomial", data = data_filtered)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3071 -0.9355 -0.8771 1.4018 1.9069
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.768939 0.054498 -14.109 < 2e-16 ***
## ln_amort 0.002630 0.003442 0.764 0.444839
## amort_flag -0.671975 0.481310 -1.396 0.162673
## ln_curr_assets -0.025108 0.006269 -4.005 6.20e-05 ***
## ln_fixed_assets -0.011226 0.008296 -1.353 0.175975
## ln_intang_assets -0.005666 0.002775 -2.042 0.041151 *
## ln_liq_assets 0.014954 0.004395 3.402 0.000668 ***
## ln_tang_assets -0.023460 0.008151 -2.878 0.003999 **
## flag_asset_problem -0.050676 0.213767 -0.237 0.812610
## ln_curr_liab 0.038410 0.003740 10.271 < 2e-16 ***
## ln_extra_exp -0.024045 0.003790 -6.345 2.23e-10 ***
## ln_extra_inc -0.012561 0.002928 -4.290 1.79e-05 ***
## inc_bef_tax_st -0.339916 0.136506 -2.490 0.012770 *
## ln_inventories -0.013441 0.001920 -7.000 2.57e-12 ***
## ln_material_exp 0.058277 0.005879 9.913 < 2e-16 ***
## ln_personnel_exp -0.028260 0.002808 -10.063 < 2e-16 ***
## profit_loss_year_st 0.933886 0.161737 5.774 7.74e-09 ***
## ln_subscribed_cap -0.008955 0.003856 -2.322 0.020209 *
## flag_error 0.120422 0.212418 0.567 0.570775
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132683 on 102398 degrees of freedom
## Residual deviance: 131861 on 102380 degrees of freedom
## (212 observations deleted due to missingness)
## AIC: 131899
##
## Number of Fisher Scoring iterations: 4
# Check model X2
glm_modelx2 <- glm(formula(paste0("fast_growth_f ~", paste0(X2, collapse = " + "))),
data = data_filtered, family = "binomial")
summary(glm_modelx2)
##
## Call:
## glm(formula = formula(paste0("fast_growth_f ~", paste0(X2, collapse = " + "))),
## family = "binomial", data = data_filtered)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4298 -0.9653 -0.8057 1.3228 2.2033
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.222e-01 3.018e-01 1.068 0.2857
## ln_amort 4.151e-03 3.474e-03 1.195 0.2321
## amort_flag -8.277e-01 4.847e-01 -1.708 0.0877 .
## ln_curr_assets -1.320e-02 6.364e-03 -2.074 0.0381 *
## ln_fixed_assets -3.326e-03 8.403e-03 -0.396 0.6923
## ln_intang_assets -6.528e-03 2.814e-03 -2.320 0.0203 *
## ln_liq_assets 2.083e-02 4.443e-03 4.689 2.74e-06 ***
## ln_tang_assets -1.739e-02 8.256e-03 -2.106 0.0352 *
## flag_asset_problem 2.033e-02 2.164e-01 0.094 0.9251
## ln_curr_liab 2.765e-02 3.782e-03 7.310 2.67e-13 ***
## ln_extra_exp -2.137e-02 3.825e-03 -5.588 2.30e-08 ***
## ln_extra_inc -1.365e-02 2.962e-03 -4.608 4.07e-06 ***
## inc_bef_tax_st -1.736e-01 1.383e-01 -1.255 0.2094
## ln_inventories -9.913e-03 1.945e-03 -5.098 3.44e-07 ***
## ln_material_exp 3.543e-02 5.938e-03 5.966 2.43e-09 ***
## ln_personnel_exp -2.426e-02 2.846e-03 -8.524 < 2e-16 ***
## profit_loss_year_st 8.789e-01 1.635e-01 5.375 7.67e-08 ***
## ln_subscribed_cap -1.880e-02 3.981e-03 -4.722 2.33e-06 ***
## flag_error 6.888e-02 2.142e-01 0.322 0.7478
## ceo_count2 2.892e-03 2.244e-02 0.129 0.8974
## ceo_count3 -2.101e-01 4.825e-02 -4.355 1.33e-05 ***
## ceo_count_flag -7.256e-02 2.096e-01 -0.346 0.7292
## foreign -8.625e-01 4.875e-01 -1.769 0.0768 .
## female -2.985e-01 3.929e-01 -0.760 0.4474
## inoffice_days -2.043e-04 4.553e-06 -44.884 < 2e-16 ***
## genderfemale 2.408e-01 2.028e-01 1.187 0.2351
## gendermale 4.411e-02 1.948e-01 0.226 0.8208
## gendermix NA NA NA NA
## originDomestic -3.868e-01 2.422e-01 -1.597 0.1102
## originForeign 4.350e-01 2.540e-01 1.712 0.0868 .
## originmix NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132683 on 102398 degrees of freedom
## Residual deviance: 129517 on 102370 degrees of freedom
## (212 observations deleted due to missingness)
## AIC: 129575
##
## Number of Fisher Scoring iterations: 4
#calculate average marginal effects (dy/dx) for logit
mx2 <- margins(glm_modelx2, vce = "none")
sum_table <- summary(glm_modelx2) %>%
coef() %>%
as.data.frame() %>%
dplyr::select(Estimate) %>%
mutate(factor = row.names(.)) %>%
merge(summary(mx2)[,c("factor","AME")])
kable(x = sum_table, format = "html", digits = 3,
col.names = c("Variable", "Coefficient", "dx/dy"),
caption = "Average Marginal Effects (dy/dx) for Logit Model")
| Variable | Coefficient | dx/dy |
|---|---|---|
| amort_flag | -0.828 | -0.183 |
| ceo_count_flag | -0.073 | -0.016 |
| ceo_count2 | 0.003 | 0.001 |
| ceo_count3 | -0.210 | -0.045 |
| female | -0.299 | -0.066 |
| flag_asset_problem | 0.020 | 0.004 |
| flag_error | 0.069 | 0.015 |
| foreign | -0.863 | -0.191 |
| genderfemale | 0.241 | 0.054 |
| gendermale | 0.044 | 0.010 |
| inc_bef_tax_st | -0.174 | -0.038 |
| inoffice_days | 0.000 | 0.000 |
| ln_amort | 0.004 | 0.001 |
| ln_curr_assets | -0.013 | -0.003 |
| ln_curr_liab | 0.028 | 0.006 |
| ln_extra_exp | -0.021 | -0.005 |
| ln_extra_inc | -0.014 | -0.003 |
| ln_fixed_assets | -0.003 | -0.001 |
| ln_intang_assets | -0.007 | -0.001 |
| ln_inventories | -0.010 | -0.002 |
| ln_liq_assets | 0.021 | 0.005 |
| ln_material_exp | 0.035 | 0.008 |
| ln_personnel_exp | -0.024 | -0.005 |
| ln_subscribed_cap | -0.019 | -0.004 |
| ln_tang_assets | -0.017 | -0.004 |
| originDomestic | -0.387 | -0.087 |
| originForeign | 0.435 | 0.103 |
| profit_loss_year_st | 0.879 | 0.194 |
ols_modelx3 <- lm(formula(paste0("fast_growth ~", paste0(X3, collapse = " + "))),
data = data_filtered)
summary(ols_modelx3)
##
## Call:
## lm(formula = formula(paste0("fast_growth ~", paste0(X3, collapse = " + "))),
## data = data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6189 -0.3750 -0.2826 0.5845 1.0029
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.614e-01 6.365e-02 8.821 < 2e-16 ***
## ln_amort 8.787e-04 7.746e-04 1.134 0.25667
## amort_flag -1.605e-01 8.975e-02 -1.788 0.07374 .
## ln_curr_assets -3.179e-03 1.411e-03 -2.253 0.02429 *
## ln_fixed_assets -8.577e-04 1.882e-03 -0.456 0.64863
## ln_intang_assets -1.437e-03 6.142e-04 -2.341 0.01926 *
## ln_liq_assets 4.592e-03 9.786e-04 4.693 2.70e-06 ***
## ln_tang_assets -3.729e-03 1.849e-03 -2.017 0.04371 *
## flag_asset_problem 4.918e-03 4.784e-02 0.103 0.91812
## ln_curr_liab 5.977e-03 8.261e-04 7.235 4.69e-13 ***
## ln_extra_exp -4.438e-03 8.200e-04 -5.412 6.25e-08 ***
## ln_extra_inc -2.733e-03 6.394e-04 -4.274 1.92e-05 ***
## inc_bef_tax_st -4.196e-02 3.001e-02 -1.398 0.16205
## ln_inventories -2.128e-03 4.334e-04 -4.911 9.09e-07 ***
## ln_material_exp 7.784e-03 1.316e-03 5.917 3.30e-09 ***
## ln_personnel_exp -5.391e-03 6.408e-04 -8.413 < 2e-16 ***
## profit_loss_year_st 2.006e-01 3.562e-02 5.631 1.80e-08 ***
## ln_subscribed_cap -4.233e-03 8.739e-04 -4.844 1.28e-06 ***
## flag_error 1.855e-02 4.571e-02 0.406 0.68481
## ceo_count2 -1.379e-03 4.963e-03 -0.278 0.78118
## ceo_count3 -4.676e-02 1.028e-02 -4.549 5.38e-06 ***
## ceo_count_flag -1.476e-02 4.428e-02 -0.333 0.73888
## foreign -1.684e-01 1.022e-01 -1.649 0.09924 .
## female -6.698e-02 8.144e-02 -0.822 0.41084
## inoffice_days -4.199e-05 9.178e-07 -45.752 < 2e-16 ***
## genderfemale 5.261e-02 4.212e-02 1.249 0.21159
## gendermale 7.562e-03 4.038e-02 0.187 0.85146
## gendermix NA NA NA NA
## originDomestic -7.513e-02 5.094e-02 -1.475 0.14029
## originForeign 8.542e-02 5.324e-02 1.605 0.10861
## originmix NA NA NA NA
## urban_mother big -1.449e-02 4.434e-03 -3.268 0.00109 **
## urban_mother -3.872e-03 4.102e-03 -0.944 0.34521
## region_mEast -8.664e-03 4.082e-03 -2.123 0.03380 *
## region_mWest -1.955e-02 4.669e-03 -4.188 2.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.47 on 102366 degrees of freedom
## (212 observations deleted due to missingness)
## Multiple R-squared: 0.03, Adjusted R-squared: 0.02969
## F-statistic: 98.92 on 32 and 102366 DF, p-value: < 2.2e-16
glm_modelx3 <- glm(formula(paste0("fast_growth_f ~", paste0(X3, collapse = " + "))),
data = data_filtered, family = "binomial")
summary(glm_modelx3)
##
## Call:
## glm(formula = formula(paste0("fast_growth_f ~", paste0(X3, collapse = " + "))),
## family = "binomial", data = data_filtered)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4344 -0.9653 -0.8037 1.3214 2.2015
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.882e-01 3.021e-01 1.285 0.1988
## ln_amort 4.356e-03 3.475e-03 1.254 0.2100
## amort_flag -8.593e-01 4.850e-01 -1.772 0.0764 .
## ln_curr_assets -1.423e-02 6.367e-03 -2.235 0.0254 *
## ln_fixed_assets -3.395e-03 8.408e-03 -0.404 0.6864
## ln_intang_assets -6.470e-03 2.815e-03 -2.298 0.0215 *
## ln_liq_assets 2.087e-02 4.445e-03 4.696 2.65e-06 ***
## ln_tang_assets -1.640e-02 8.263e-03 -1.984 0.0472 *
## flag_asset_problem 2.017e-02 2.163e-01 0.093 0.9257
## ln_curr_liab 2.693e-02 3.783e-03 7.119 1.09e-12 ***
## ln_extra_exp -2.088e-02 3.826e-03 -5.457 4.85e-08 ***
## ln_extra_inc -1.297e-02 2.965e-03 -4.373 1.23e-05 ***
## inc_bef_tax_st -1.814e-01 1.383e-01 -1.312 0.1895
## ln_inventories -9.263e-03 1.948e-03 -4.755 1.98e-06 ***
## ln_material_exp 3.349e-02 5.951e-03 5.627 1.83e-08 ***
## ln_personnel_exp -2.359e-02 2.850e-03 -8.277 < 2e-16 ***
## profit_loss_year_st 9.156e-01 1.637e-01 5.594 2.22e-08 ***
## ln_subscribed_cap -2.020e-02 3.990e-03 -5.062 4.14e-07 ***
## flag_error 6.079e-02 2.144e-01 0.283 0.7768
## ceo_count2 -7.427e-04 2.246e-02 -0.033 0.9736
## ceo_count3 -2.149e-01 4.829e-02 -4.450 8.61e-06 ***
## ceo_count_flag -6.813e-02 2.096e-01 -0.325 0.7452
## foreign -8.260e-01 4.876e-01 -1.694 0.0903 .
## female -3.120e-01 3.926e-01 -0.795 0.4268
## inoffice_days -2.037e-04 4.552e-06 -44.749 < 2e-16 ***
## genderfemale 2.466e-01 2.026e-01 1.217 0.2235
## gendermale 3.447e-02 1.946e-01 0.177 0.8594
## gendermix NA NA NA NA
## originDomestic -3.672e-01 2.422e-01 -1.516 0.1296
## originForeign 4.192e-01 2.541e-01 1.650 0.0990 .
## originmix NA NA NA NA
## urban_mother big -6.446e-02 2.004e-02 -3.217 0.0013 **
## urban_mother -1.572e-02 1.847e-02 -0.851 0.3949
## region_mEast -3.921e-02 1.851e-02 -2.119 0.0341 *
## region_mWest -8.996e-02 2.137e-02 -4.209 2.56e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132683 on 102398 degrees of freedom
## Residual deviance: 129466 on 102366 degrees of freedom
## (212 observations deleted due to missingness)
## AIC: 129532
##
## Number of Fisher Scoring iterations: 4
ols_modelx4 <- lm(formula(paste0("fast_growth ~", paste0(X4, collapse = " + "))),
data = data_filtered)
summary(ols_modelx4)
##
## Call:
## lm(formula = formula(paste0("fast_growth ~", paste0(X4, collapse = " + "))),
## data = data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7937 -0.3563 -0.2719 0.5911 0.9990
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.277e-01 6.957e-02 7.585 3.35e-14 ***
## ln_amort 1.949e-03 7.710e-04 2.528 0.011466 *
## amort_flag -1.241e-01 8.895e-02 -1.395 0.162892
## ln_curr_assets -2.446e-03 1.450e-03 -1.687 0.091586 .
## ln_fixed_assets -6.977e-04 1.867e-03 -0.374 0.708662
## ln_intang_assets -1.402e-03 6.109e-04 -2.294 0.021771 *
## ln_liq_assets 3.945e-03 9.755e-04 4.044 5.26e-05 ***
## ln_tang_assets -3.505e-03 1.833e-03 -1.912 0.055884 .
## flag_asset_problem 8.651e-03 4.743e-02 0.182 0.855293
## ln_curr_liab 4.762e-03 8.217e-04 5.795 6.85e-09 ***
## ln_extra_exp -4.053e-03 8.132e-04 -4.984 6.24e-07 ***
## ln_extra_inc -3.288e-03 6.370e-04 -5.163 2.44e-07 ***
## inc_bef_tax_st -4.212e-02 2.986e-02 -1.410 0.158410
## ln_inventories -1.503e-03 4.421e-04 -3.398 0.000678 ***
## ln_material_exp 9.046e-03 1.312e-03 6.895 5.41e-12 ***
## ln_personnel_exp -4.422e-03 6.372e-04 -6.939 3.97e-12 ***
## profit_loss_year_st 2.017e-01 3.533e-02 5.708 1.14e-08 ***
## ln_subscribed_cap -5.034e-03 8.738e-04 -5.760 8.42e-09 ***
## flag_error 1.117e-02 4.531e-02 0.247 0.805262
## ceo_count2 3.018e-03 4.922e-03 0.613 0.539777
## ceo_count3 -4.069e-02 1.020e-02 -3.991 6.57e-05 ***
## ceo_count_flag -1.368e-02 4.389e-02 -0.312 0.755347
## foreign -1.751e-01 1.013e-01 -1.728 0.083905 .
## female -5.775e-02 8.073e-02 -0.715 0.474429
## inoffice_days -3.374e-05 9.429e-07 -35.781 < 2e-16 ***
## genderfemale 4.761e-02 4.175e-02 1.140 0.254094
## gendermale 1.282e-02 4.003e-02 0.320 0.748862
## gendermix NA NA NA NA
## originDomestic -8.153e-02 5.050e-02 -1.614 0.106425
## originForeign 8.254e-02 5.278e-02 1.564 0.117816
## originmix NA NA NA NA
## urban_mother big -1.478e-02 4.412e-03 -3.350 0.000808 ***
## urban_mother -3.548e-03 4.085e-03 -0.868 0.385147
## region_mEast -8.864e-03 4.055e-03 -2.186 0.028806 *
## region_mWest -1.907e-02 4.638e-03 -4.112 3.93e-05 ***
## year2011 6.759e-03 5.475e-03 1.235 0.216982
## year2012 4.183e-02 5.357e-03 7.809 5.82e-15 ***
## year2013 4.402e-02 5.280e-03 8.337 < 2e-16 ***
## year2014 4.686e-02 5.328e-03 8.794 < 2e-16 ***
## year2015 5.511e-02 5.431e-03 10.147 < 2e-16 ***
## ind2_cat26 -7.682e-02 2.967e-02 -2.589 0.009617 **
## ind2_cat27 -6.822e-02 3.006e-02 -2.269 0.023242 *
## ind2_cat28 -8.422e-02 2.938e-02 -2.867 0.004143 **
## ind2_cat29 -3.986e-03 3.124e-02 -0.128 0.898447
## ind2_cat30 7.559e-03 3.291e-02 0.230 0.818332
## ind2_cat32 -8.025e-02 3.370e-02 -2.381 0.017244 *
## ind2_cat33 -5.575e-02 2.940e-02 -1.896 0.057946 .
## ind2_cat40 7.844e-03 3.241e-02 0.242 0.808776
## ind2_cat55 -5.758e-02 2.938e-02 -1.960 0.050035 .
## ind2_cat56 -6.403e-02 2.915e-02 -2.197 0.028025 *
## ind2_cat60 -1.684e-05 3.215e-02 -0.001 0.999582
## ind2_cat99 -2.332e-01 8.521e-02 -2.737 0.006203 **
## imputed1 2.115e-01 5.666e-03 37.323 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4658 on 102348 degrees of freedom
## (212 observations deleted due to missingness)
## Multiple R-squared: 0.04744, Adjusted R-squared: 0.04698
## F-statistic: 102 on 50 and 102348 DF, p-value: < 2.2e-16
glm_modelx4 <- glm(formula(paste0("fast_growth_f ~", paste0(X4, collapse = " + "))),
data = data_filtered, family = "binomial")
summary(glm_modelx4)
##
## Call:
## glm(formula = formula(paste0("fast_growth_f ~", paste0(X4, collapse = " + "))),
## family = "binomial", data = data_filtered)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7545 -0.9343 -0.7891 1.3323 2.1940
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.433e-01 3.308e-01 0.735 0.462167
## ln_amort 9.184e-03 3.532e-03 2.600 0.009326 **
## amort_flag -7.111e-01 4.869e-01 -1.460 0.144184
## ln_curr_assets -1.142e-02 6.655e-03 -1.717 0.086019 .
## ln_fixed_assets -2.948e-03 8.501e-03 -0.347 0.728768
## ln_intang_assets -6.389e-03 2.844e-03 -2.246 0.024700 *
## ln_liq_assets 1.829e-02 4.505e-03 4.059 4.92e-05 ***
## ln_tang_assets -1.578e-02 8.350e-03 -1.890 0.058783 .
## flag_asset_problem 4.066e-02 2.175e-01 0.187 0.851725
## ln_curr_liab 2.209e-02 3.828e-03 5.769 7.95e-09 ***
## ln_extra_exp -1.925e-02 3.845e-03 -5.007 5.53e-07 ***
## ln_extra_inc -1.547e-02 2.999e-03 -5.158 2.49e-07 ***
## inc_bef_tax_st -1.847e-01 1.396e-01 -1.323 0.185791
## ln_inventories -6.635e-03 2.025e-03 -3.276 0.001053 **
## ln_material_exp 3.989e-02 6.050e-03 6.593 4.31e-11 ***
## ln_personnel_exp -1.996e-02 2.892e-03 -6.900 5.18e-12 ***
## profit_loss_year_st 9.307e-01 1.647e-01 5.651 1.59e-08 ***
## ln_subscribed_cap -2.385e-02 4.051e-03 -5.886 3.97e-09 ***
## flag_error 2.795e-02 2.166e-01 0.129 0.897317
## ceo_count2 1.770e-02 2.265e-02 0.782 0.434453
## ceo_count3 -1.916e-01 4.866e-02 -3.937 8.24e-05 ***
## ceo_count_flag -6.923e-02 2.103e-01 -0.329 0.742060
## foreign -8.605e-01 4.885e-01 -1.761 0.078182 .
## female -2.834e-01 3.955e-01 -0.717 0.473657
## inoffice_days -1.651e-04 4.667e-06 -35.383 < 2e-16 ***
## genderfemale 2.309e-01 2.041e-01 1.131 0.257970
## gendermale 5.366e-02 1.960e-01 0.274 0.784309
## gendermix NA NA NA NA
## originDomestic -3.992e-01 2.428e-01 -1.644 0.100118
## originForeign 4.078e-01 2.546e-01 1.601 0.109274
## originmix NA NA NA NA
## urban_mother big -6.720e-02 2.031e-02 -3.309 0.000937 ***
## urban_mother -1.476e-02 1.873e-02 -0.788 0.430656
## region_mEast -4.022e-02 1.872e-02 -2.149 0.031647 *
## region_mWest -8.890e-02 2.159e-02 -4.117 3.84e-05 ***
## year2011 3.384e-02 2.614e-02 1.295 0.195482
## year2012 1.982e-01 2.521e-02 7.862 3.77e-15 ***
## year2013 2.064e-01 2.479e-02 8.325 < 2e-16 ***
## year2014 2.177e-01 2.501e-02 8.704 < 2e-16 ***
## year2015 2.545e-01 2.541e-02 10.016 < 2e-16 ***
## ind2_cat26 -3.479e-01 1.334e-01 -2.607 0.009128 **
## ind2_cat27 -3.076e-01 1.353e-01 -2.273 0.023019 *
## ind2_cat28 -3.827e-01 1.320e-01 -2.899 0.003747 **
## ind2_cat29 -1.927e-02 1.403e-01 -0.137 0.890732
## ind2_cat30 3.186e-02 1.475e-01 0.216 0.829017
## ind2_cat32 -3.659e-01 1.542e-01 -2.373 0.017657 *
## ind2_cat33 -2.508e-01 1.320e-01 -1.899 0.057540 .
## ind2_cat40 3.048e-02 1.453e-01 0.210 0.833837
## ind2_cat55 -2.606e-01 1.320e-01 -1.974 0.048390 *
## ind2_cat56 -2.892e-01 1.309e-01 -2.210 0.027099 *
## ind2_cat60 -3.308e-03 1.442e-01 -0.023 0.981701
## ind2_cat99 -1.195e+00 4.702e-01 -2.542 0.011025 *
## imputed1 8.633e-01 2.513e-02 34.357 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132683 on 102398 degrees of freedom
## Residual deviance: 127800 on 102348 degrees of freedom
## (212 observations deleted due to missingness)
## AIC: 127902
##
## Number of Fisher Scoring iterations: 4
#calculate average marginal effects (dy/dx) for logit
# vce="none" makes it run much faster, here we do not need variances
m <- margins(glm_modelx4, vce = "none")
sum_table2 <- summary(glm_modelx4) %>%
coef() %>%
as.data.frame() %>%
select(Estimate, `Std. Error`) %>%
mutate(factor = row.names(.)) %>%
merge(summary(m)[,c("factor","AME")])
kable(x = sum_table2, format = "html", digits = 3,
col.names = c("Variable", "Coefficient", "SE", "dx/dy"),
caption = "Average Marginal Effects (dy/dx) for Logit Model")
| Variable | Coefficient | SE | dx/dy |
|---|---|---|---|
| amort_flag | -0.711 | 0.487 | -0.154 |
| ceo_count_flag | -0.069 | 0.210 | -0.015 |
| ceo_count2 | 0.018 | 0.023 | 0.004 |
| ceo_count3 | -0.192 | 0.049 | -0.040 |
| female | -0.283 | 0.396 | -0.061 |
| flag_asset_problem | 0.041 | 0.218 | 0.009 |
| flag_error | 0.028 | 0.217 | 0.006 |
| foreign | -0.860 | 0.489 | -0.187 |
| genderfemale | 0.231 | 0.204 | 0.051 |
| gendermale | 0.054 | 0.196 | 0.011 |
| imputed1 | 0.863 | 0.025 | 0.203 |
| inc_bef_tax_st | -0.185 | 0.140 | -0.040 |
| ind2_cat26 | -0.348 | 0.133 | -0.078 |
| ind2_cat27 | -0.308 | 0.135 | -0.069 |
| ind2_cat28 | -0.383 | 0.132 | -0.085 |
| ind2_cat29 | -0.019 | 0.140 | -0.004 |
| ind2_cat30 | 0.032 | 0.148 | 0.007 |
| ind2_cat32 | -0.366 | 0.154 | -0.081 |
| ind2_cat33 | -0.251 | 0.132 | -0.057 |
| ind2_cat40 | 0.030 | 0.145 | 0.007 |
| ind2_cat55 | -0.261 | 0.132 | -0.059 |
| ind2_cat56 | -0.289 | 0.131 | -0.065 |
| ind2_cat60 | -0.003 | 0.144 | -0.001 |
| ind2_cat99 | -1.195 | 0.470 | -0.230 |
| inoffice_days | 0.000 | 0.000 | 0.000 |
| ln_amort | 0.009 | 0.004 | 0.002 |
| ln_curr_assets | -0.011 | 0.007 | -0.002 |
| ln_curr_liab | 0.022 | 0.004 | 0.005 |
| ln_extra_exp | -0.019 | 0.004 | -0.004 |
| ln_extra_inc | -0.015 | 0.003 | -0.003 |
| ln_fixed_assets | -0.003 | 0.009 | -0.001 |
| ln_intang_assets | -0.006 | 0.003 | -0.001 |
| ln_inventories | -0.007 | 0.002 | -0.001 |
| ln_liq_assets | 0.018 | 0.005 | 0.004 |
| ln_material_exp | 0.040 | 0.006 | 0.009 |
| ln_personnel_exp | -0.020 | 0.003 | -0.004 |
| ln_subscribed_cap | -0.024 | 0.004 | -0.005 |
| ln_tang_assets | -0.016 | 0.008 | -0.003 |
| originDomestic | -0.399 | 0.243 | -0.088 |
| originForeign | 0.408 | 0.255 | 0.095 |
| profit_loss_year_st | 0.931 | 0.165 | 0.202 |
| region_mEast | -0.040 | 0.019 | -0.009 |
| region_mWest | -0.089 | 0.022 | -0.019 |
| urban_mother | -0.015 | 0.019 | -0.003 |
| urban_mother big | -0.067 | 0.020 | -0.015 |
| year2011 | 0.034 | 0.026 | 0.007 |
| year2012 | 0.198 | 0.025 | 0.042 |
| year2013 | 0.206 | 0.025 | 0.044 |
| year2014 | 0.218 | 0.025 | 0.047 |
| year2015 | 0.254 | 0.025 | 0.055 |
Dividing dataset to two sets (80% for training and 20% for holdout)
set.seed(1234)
data_filtered <- data_filtered %>%
drop_na(flag_asset_problem, flag_error)
data_filtered <- data_filtered %>%
mutate(fast_growth_f = factor(fast_growth_f,
labels = make.names(levels(fast_growth_f))))
train_indices <- as.integer(createDataPartition(data_filtered$fast_growth_f, p = 0.8, list = FALSE))
data_train <- data_filtered[train_indices, ]
data_holdout <- data_filtered[-train_indices, ]
dim(data_train)
## [1] 81920 87
dim(data_holdout)
## [1] 20479 87
# 5 cross-validation
train_control <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummaryExtended,
savePredictions = TRUE)
logit_model_vars <- list("X1" = X1, "X2" = X2, "X3" = X3, "X4" = X4, "X5" = X5, "X6" = X6)
CV_RMSE_folds <- list()
logit_models <- list()
for (model_name in names(logit_model_vars)) {
features <- logit_model_vars[[model_name]]
set.seed(1234)
glm_model <- train(
formula(paste0("fast_growth_f ~", paste0(features, collapse = " + "))),
method = "glm",
data = data_train,
family = binomial,
trControl = train_control
)
logit_models[[model_name]] <- glm_model
# Calculate RMSE on test for each fold
CV_RMSE_folds[[model_name]] <- glm_model$resample[,c("Resample", "RMSE")]
}
# Logit lasso -----------------------------------------------------------
lambda <- 10^seq(-1, -4, length = 10)
grid <- expand.grid("alpha" = 1, lambda = lambda)
set.seed(1234)
system.time({
logit_lasso_model <- train(
formula(paste0("fast_growth_f ~", paste0(logitvars, collapse = " + "))),
data = data_train,
method = "glmnet",
preProcess = c("center", "scale"),
family = "binomial",
trControl = train_control,
tuneGrid = grid,
na.action = na.exclude
)
})
## user system elapsed
## 115.098 3.702 120.353
tuned_logit_lasso_model <- logit_lasso_model$finalModel
best_lambda <- logit_lasso_model$bestTune$lambda
logit_models[["LASSO"]] <- logit_lasso_model
lasso_coeffs <- as.matrix(coef(tuned_logit_lasso_model, best_lambda))
CV_RMSE_folds[["LASSO"]] <- logit_lasso_model$resample[,c("Resample", "RMSE")]
# Draw ROC Curve and calculate AUC for each folds --------------------------------
CV_AUC_folds <- list()
for (model_name in names(logit_models)) {
auc <- list()
model <- logit_models[[model_name]]
for (fold in c("Fold1", "Fold2", "Fold3", "Fold4", "Fold5")) {
cv_fold <-
model$pred %>%
filter(Resample == fold)
roc_obj <- roc(cv_fold$obs, cv_fold$X1)
auc[[fold]] <- as.numeric(roc_obj$auc)
}
CV_AUC_folds[[model_name]] <- data.frame("Resample" = names(auc),
"AUC" = unlist(auc))
}
# For each model: average RMSE and average AUC for models ----------------------------------
CV_RMSE <- list()
CV_AUC <- list()
for (model_name in names(logit_models)) {
CV_RMSE[[model_name]] <- mean(CV_RMSE_folds[[model_name]]$RMSE)
CV_AUC[[model_name]] <- mean(CV_AUC_folds[[model_name]]$AUC)
}
# 7 models, (6 logit and the logit lasso). For each we have a 5-CV RMSE and AUC.
# We pick our preferred model based on that. -----------------------------------------------
nvars <- lapply(logit_models, FUN = function(x) length(x$coefnames))
nvars[["LASSO"]] <- sum(lasso_coeffs != 0)
logit_summary1 <- data.frame("Number of predictors" = unlist(nvars),
"CV RMSE" = unlist(CV_RMSE),
"CV AUC" = unlist(CV_AUC))
kable(x = logit_summary1, format = "html", booktabs = TRUE, digits = 3, row.names = TRUE,
linesep = "", col.names = c("Number of predictors","CV RMSE","CV AUC"))
| Number of predictors | CV RMSE | CV AUC | |
|---|---|---|---|
| X1 | 18 | 0.475 | 0.554 |
| X2 | 30 | 0.470 | 0.608 |
| X3 | 34 | 0.470 | 0.609 |
| X4 | 52 | 0.466 | 0.623 |
| X5 | 67 | 0.464 | 0.634 |
| X6 | 115 | 0.463 | 0.636 |
| LASSO | 102 | 0.463 | 0.616 |
Based on this information, I choose for model 6. Even though the difference between model 4 and model 6 in terms of CV RMSE not so huge, AUC is still quite significantly better. Interestingly, LASSO shows the same result for CV RMSE, however, it is much worse in CV AUC.
best_logit_no_loss <- logit_models[["X6"]]
logit_predicted_probabilities_holdout <- predict(best_logit_no_loss, newdata = data_holdout, type = "prob")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
data_holdout[,"best_logit_no_loss_pred"] <- logit_predicted_probabilities_holdout[,"X1"]
RMSE(data_holdout[, "best_logit_no_loss_pred", drop=TRUE], data_holdout$fast_growth)
## [1] 0.4646898
# discrete ROC (with thresholds in steps)
thresholds <- seq(0.05, 0.75, by = 0.05)
cm <- list()
true_positive_rates <- c()
false_positive_rates <- c()
for (thr in thresholds) {
holdout_prediction <- ifelse(data_holdout[,"best_logit_no_loss_pred"] < thr, "X0", "X1") %>%
factor(levels = c("X0", "X1"))
cm_thr <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)$table
cm[[as.character(thr)]] <- cm_thr
true_positive_rates <- c(true_positive_rates, cm_thr["X1", "X1"] /
(cm_thr["X1", "X1"] + cm_thr["X0", "X1"]))
false_positive_rates <- c(false_positive_rates, cm_thr["X1", "X0"] /
(cm_thr["X1", "X0"] + cm_thr["X0", "X0"]))
}
tpr_fpr_for_thresholds <- tibble(
"threshold" = thresholds,
"true_positive_rate" = true_positive_rates,
"false_positive_rate" = false_positive_rates
)
discrete_roc_plot <- ggplot(
data = tpr_fpr_for_thresholds,
aes(x = false_positive_rate, y = true_positive_rate, color = threshold)) +
labs(x = "False positive rate (1 - Specificity)", y = "True positive rate (Sensitivity)") +
geom_point(size=2, alpha=0.8) +
scale_color_viridis(option = "D", direction = -1) +
scale_x_continuous(expand = c(0.01,0.01), limit=c(0,1), breaks = seq(0,1,0.1)) +
scale_y_continuous(expand = c(0.01,0.01), limit=c(0,1), breaks = seq(0,1,0.1)) +
theme_bg() +
theme(legend.position ="right") +
theme(legend.title = element_text(size = 4),
legend.text = element_text(size = 4),
legend.key.size = unit(.4, "cm"))
discrete_roc_plot
# continuous ROC on holdout with best model (Logit 4) -------------------------------------------
roc_obj_holdout <- roc(data_holdout$fast_growth_f, data_holdout$best_logit_no_loss_pred)
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
createRocPlot(roc_obj_holdout, "best_logit_no_loss_roc_plot_holdout")
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
# Confusion table with different tresholds ----------------------------------------------------------
# default: the threshold 0.5 is used to convert probabilities to binary classes
logit_class_prediction <- predict(best_logit_no_loss, newdata = data_holdout)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
summary(logit_class_prediction)
## X0 X1
## 18521 1958
# confusion matrix: summarize different type of errors and successfully predicted cases
# positive = "yes": explicitly specify the positive case
cm_object1 <- confusionMatrix(logit_class_prediction, data_holdout$fast_growth_f, positive = "X1")
cm1 <- cm_object1$table
cm1
## Reference
## Prediction X0 X1
## X0 12524 5997
## X1 773 1185
# we can apply different thresholds
# 0.4
holdout_prediction <-
ifelse(data_holdout$best_logit_no_loss_pred < 0.4, "X0", "X1") %>%
factor(levels = c("X0", "X1"))
cm_object1b <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)
cm1b <- cm_object1b$table
cm1b
## Reference
## Prediction X0 X1
## X0 10455 4419
## X1 2842 2763
# a sensible choice: mean of predicted probabilities
mean_predicted_default_prob <- mean(data_holdout$best_logit_no_loss_pred)
mean_predicted_default_prob
## [1] 0.3507365
holdout_prediction <-
ifelse(data_holdout$best_logit_no_loss_pred < mean_predicted_default_prob, "X0", "X1") %>%
factor(levels = c("X0", "X1"))
cm_object2 <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)
cm2 <- cm_object2$table
cm2
## Reference
## Prediction X0 X1
## X0 7965 2960
## X1 5332 4222
create_calibration_plot(data_holdout,
prob_var = "best_logit_no_loss_pred",
actual_var = "fast_growth",
n_bins = 15)
## Warning: Removed 2 row(s) containing missing values (geom_path).
Let's assume my business problem sounds in the following way: I need to know whether a firm is going to be fast growing, because my client wants to invest only in fast growing firms (to buy its market stocks, for example). In terms of risk aversity, I know that my client is very risk averse and does not want to lose any of his money or have low incomes from these investments (he has good other options for investments). At the same time, he wants to multiply wealth, that is why not capturing successful companies will make him a bit dissapointed. Bearing this in mind, I would specify my loss function in the following way:
False Positive (predicting that the firm is fast-growing when in fact it is not) = 9 False Negative (predicting that the firm is not fast-growing when in fact it is) = 3
FP <- 9
FN <- 3
cost = FN/FP # 0.33
prevelance = sum(data_train$fast_growth)/length(data_train$fast_growth) # 0.3506836
In some cases, I received results Inf. If I understood correctly, the main reason is that in such folders, all predicted probabilities were classified as Negatives (since my loss function punishes FP quite severely), that is why it was not possible to find an optimal threshold in these folders. For this and the following tasks, I imputed Inf with NAs.
# Draw ROC Curve and find optimal threshold with loss function --------------------------
best_tresholds <- list()
expected_loss <- list()
logit_cv_rocs <- list()
logit_cv_threshold <- list()
logit_cv_expected_loss <- list()
for (model_name in names(logit_models)) {
model <- logit_models[[model_name]]
colname <- paste0(model_name,"_prediction")
best_tresholds_cv <- list()
expected_loss_cv <- list()
for (fold in c("Fold1", "Fold2", "Fold3", "Fold4", "Fold5")) {
cv_fold <-
model$pred %>%
filter(Resample == fold)
roc_obj <- roc(cv_fold$obs, cv_fold$X1)
best_treshold <- coords(roc_obj, "best", ret="all", transpose = FALSE,
best.method = "youden", best.weights = c(cost, prevelance))
best_tresholds_cv[[fold]] <- best_treshold$threshold
expected_loss_cv[[fold]] <- (best_treshold$fp*FP + best_treshold$fn*FN)/length(cv_fold$X1)
}
# average
best_tresholds_cv <- unlist(best_tresholds_cv)
best_tresholds_cv[is.infinite(best_tresholds_cv)] <- NA
best_tresholds[[model_name]] <- mean(best_tresholds_cv, na.rm = T)
expected_loss[[model_name]] <- mean(unlist(expected_loss_cv))
# for fold #5
logit_cv_rocs[[model_name]] <- roc_obj
best_treshold <-
logit_cv_threshold[[model_name]] <- best_treshold
logit_cv_expected_loss[[model_name]] <- expected_loss_cv[[fold]]
}
logit_summary2 <- data.frame("Avg of optimal thresholds" = unlist(best_tresholds),
"Threshold for Fold5" = sapply(logit_cv_threshold, function(x) {x$threshold}),
"Avg expected loss" = unlist(expected_loss),
"Expected loss for Fold5" = unlist(logit_cv_expected_loss))
kable(x = logit_summary2, format = "html", booktabs=TRUE, digits = 3, row.names = TRUE,
linesep = "", col.names = c("Avg of optimal thresholds","Threshold for fold #5",
"Avg expected loss","Expected loss for fold #5"))
| Avg of optimal thresholds | Threshold for fold #5 | Avg expected loss | Expected loss for fold #5 | |
|---|---|---|---|---|
| X1 | 0.549 | 0.536 | 1.051 | 1.051 |
| X2 | 0.598 | Inf | 1.052 | 1.052 |
| X3 | 0.587 | Inf | 1.052 | 1.052 |
| X4 | 0.706 | 0.696 | 1.051 | 1.051 |
| X5 | 0.730 | 0.737 | 1.050 | 1.051 |
| X6 | 0.721 | 0.663 | 1.049 | 1.048 |
| LASSO | 0.722 | 0.666 | 1.051 | 1.052 |
Based on an average expected loss, I should choose model X6.
best_logit_with_loss <- logit_models[["X6"]]
best_logit_optimal_treshold <- best_tresholds[["X6"]]
logit_predicted_probabilities_holdout <- predict(best_logit_with_loss, newdata = data_holdout, type = "prob")
data_holdout[,"best_logit_with_loss_pred"] <- logit_predicted_probabilities_holdout[,"X1"]
# ROC curve on holdout
roc_obj_holdout <- roc(data_holdout$fast_growth, data_holdout[, "best_logit_with_loss_pred", drop=TRUE])
# Get expected loss on holdout
holdout_treshold <- coords(roc_obj_holdout, x = best_logit_optimal_treshold, input= "threshold",
ret="all", transpose = FALSE)
expected_loss_holdout <- (holdout_treshold$fp*FP + holdout_treshold$fn*FN)/length(data_holdout$fast_growth_f)
expected_loss_holdout
## [1] 1.052102
# Confusion table on holdout with optimal threshold
holdout_prediction <-
ifelse(data_holdout$best_logit_with_loss_pred < best_logit_optimal_treshold, "X0", "X1") %>%
factor(levels = c("X0", "X1"))
cm_object3 <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)
cm3 <- cm_object3$table
cm3
## Reference
## Prediction X0 X1
## X0 13276 7119
## X1 21 63
data_for_graph <- data_train
levels(data_for_graph$fast_growth_f) <- list("slow" = "X0", "fast" = "X1")
set.seed(1234)
rf_for_graph <-
rpart(
formula = formula(paste0("fast_growth_f ~ ", paste0(rfvars, collapse = " + "))),
data = data_for_graph,
control = rpart.control(cp = 0.0028, minbucket = 100)
)
rpart.plot(rf_for_graph, tweak=1, digits=2, extra=107, under = TRUE)
# 5 fold cross-validation
train_control <- trainControl(
method = "cv",
n = 5,
classProbs = TRUE, # same as probability = TRUE in ranger
summaryFunction = twoClassSummaryExtended,
savePredictions = TRUE
)
train_control$verboseIter <- TRUE
tune_grid <- expand.grid(
.mtry = c(5, 6, 7),
.splitrule = "gini",
.min.node.size = c(10, 15)
)
# getModelInfo("ranger")
set.seed(1234)
rf_model_p <- train(
formula(paste0("fast_growth_f ~ ", paste0(rfvars, collapse = " + "))),
method = "ranger",
data = data_train,
tuneGrid = tune_grid,
trControl = train_control
)
## + Fold1: mtry=5, splitrule=gini, min.node.size=10
## - Fold1: mtry=5, splitrule=gini, min.node.size=10
## + Fold1: mtry=6, splitrule=gini, min.node.size=10
## - Fold1: mtry=6, splitrule=gini, min.node.size=10
## + Fold1: mtry=7, splitrule=gini, min.node.size=10
## - Fold1: mtry=7, splitrule=gini, min.node.size=10
## + Fold1: mtry=5, splitrule=gini, min.node.size=15
## - Fold1: mtry=5, splitrule=gini, min.node.size=15
## + Fold1: mtry=6, splitrule=gini, min.node.size=15
## - Fold1: mtry=6, splitrule=gini, min.node.size=15
## + Fold1: mtry=7, splitrule=gini, min.node.size=15
## - Fold1: mtry=7, splitrule=gini, min.node.size=15
## + Fold2: mtry=5, splitrule=gini, min.node.size=10
## - Fold2: mtry=5, splitrule=gini, min.node.size=10
## + Fold2: mtry=6, splitrule=gini, min.node.size=10
## - Fold2: mtry=6, splitrule=gini, min.node.size=10
## + Fold2: mtry=7, splitrule=gini, min.node.size=10
## - Fold2: mtry=7, splitrule=gini, min.node.size=10
## + Fold2: mtry=5, splitrule=gini, min.node.size=15
## - Fold2: mtry=5, splitrule=gini, min.node.size=15
## + Fold2: mtry=6, splitrule=gini, min.node.size=15
## - Fold2: mtry=6, splitrule=gini, min.node.size=15
## + Fold2: mtry=7, splitrule=gini, min.node.size=15
## - Fold2: mtry=7, splitrule=gini, min.node.size=15
## + Fold3: mtry=5, splitrule=gini, min.node.size=10
## - Fold3: mtry=5, splitrule=gini, min.node.size=10
## + Fold3: mtry=6, splitrule=gini, min.node.size=10
## - Fold3: mtry=6, splitrule=gini, min.node.size=10
## + Fold3: mtry=7, splitrule=gini, min.node.size=10
## - Fold3: mtry=7, splitrule=gini, min.node.size=10
## + Fold3: mtry=5, splitrule=gini, min.node.size=15
## - Fold3: mtry=5, splitrule=gini, min.node.size=15
## + Fold3: mtry=6, splitrule=gini, min.node.size=15
## - Fold3: mtry=6, splitrule=gini, min.node.size=15
## + Fold3: mtry=7, splitrule=gini, min.node.size=15
## - Fold3: mtry=7, splitrule=gini, min.node.size=15
## + Fold4: mtry=5, splitrule=gini, min.node.size=10
## - Fold4: mtry=5, splitrule=gini, min.node.size=10
## + Fold4: mtry=6, splitrule=gini, min.node.size=10
## - Fold4: mtry=6, splitrule=gini, min.node.size=10
## + Fold4: mtry=7, splitrule=gini, min.node.size=10
## - Fold4: mtry=7, splitrule=gini, min.node.size=10
## + Fold4: mtry=5, splitrule=gini, min.node.size=15
## - Fold4: mtry=5, splitrule=gini, min.node.size=15
## + Fold4: mtry=6, splitrule=gini, min.node.size=15
## - Fold4: mtry=6, splitrule=gini, min.node.size=15
## + Fold4: mtry=7, splitrule=gini, min.node.size=15
## - Fold4: mtry=7, splitrule=gini, min.node.size=15
## + Fold5: mtry=5, splitrule=gini, min.node.size=10
## - Fold5: mtry=5, splitrule=gini, min.node.size=10
## + Fold5: mtry=6, splitrule=gini, min.node.size=10
## - Fold5: mtry=6, splitrule=gini, min.node.size=10
## + Fold5: mtry=7, splitrule=gini, min.node.size=10
## - Fold5: mtry=7, splitrule=gini, min.node.size=10
## + Fold5: mtry=5, splitrule=gini, min.node.size=15
## - Fold5: mtry=5, splitrule=gini, min.node.size=15
## + Fold5: mtry=6, splitrule=gini, min.node.size=15
## - Fold5: mtry=6, splitrule=gini, min.node.size=15
## + Fold5: mtry=7, splitrule=gini, min.node.size=15
## - Fold5: mtry=7, splitrule=gini, min.node.size=15
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 7, splitrule = gini, min.node.size = 10 on full training set
rf_model_p$results
## mtry splitrule min.node.size Accuracy Kappa RMSE AccuracySD
## 1 5 gini 10 0.7209961 0.2850996 0.4248853 0.002461767
## 2 5 gini 15 0.7157715 0.2691237 0.4274002 0.002287683
## 3 6 gini 10 0.7257447 0.3018014 0.4217344 0.002788773
## 4 6 gini 15 0.7208863 0.2868017 0.4244283 0.002329176
## 5 7 gini 10 0.7299683 0.3157844 0.4194954 0.002887538
## 6 7 gini 15 0.7248047 0.2999997 0.4227247 0.002527797
## KappaSD RMSESD
## 1 0.006953160 0.0009671948
## 2 0.005814953 0.0007873598
## 3 0.007371201 0.0010379908
## 4 0.005885877 0.0008587080
## 5 0.007161393 0.0005065771
## 6 0.006734663 0.0008682306
best_mtry <- rf_model_p$bestTune$mtry
best_min_node_size <- rf_model_p$bestTune$min.node.size
# Get average (ie over the folds) RMSE and AUC ------------------------------------
CV_RMSE_folds[["rf_p"]] <- rf_model_p$resample[,c("Resample", "RMSE")]
auc <- list()
for (fold in c("Fold1", "Fold2", "Fold3", "Fold4", "Fold5")) {
cv_fold <-
rf_model_p$pred %>%
filter(Resample == fold)
roc_obj <- roc(cv_fold$obs, cv_fold$X1)
auc[[fold]] <- as.numeric(roc_obj$auc)
}
CV_AUC_folds[["rf_p"]] <- data.frame("Resample" = names(auc),
"AUC" = unlist(auc))
CV_RMSE[["rf_p"]] <- mean(CV_RMSE_folds[["rf_p"]]$RMSE)
CV_AUC[["rf_p"]] <- mean(CV_AUC_folds[["rf_p"]]$AUC)
# Now use loss function and search for best thresholds and expected loss over folds -----
best_tresholds_cv <- list()
expected_loss_cv <- list()
for (fold in c("Fold1", "Fold2", "Fold3", "Fold4", "Fold5")) {
cv_fold <-
rf_model_p$pred %>%
filter(mtry == best_mtry,
min.node.size == best_min_node_size,
Resample == fold)
roc_obj <- roc(cv_fold$obs, cv_fold$X1)
best_treshold <- coords(roc_obj, "best", ret="all", transpose = FALSE,
best.method="youden", best.weights=c(cost, prevelance))
best_tresholds_cv[[fold]] <- best_treshold$threshold
expected_loss_cv[[fold]] <- (best_treshold$fp*FP + best_treshold$fn*FN)/length(cv_fold$X1)
}
# average
best_tresholds[["rf_p"]] <- mean(unlist(best_tresholds_cv))
expected_loss[["rf_p"]] <- mean(unlist(expected_loss_cv))
rf_summary <- data.frame("CV RMSE" = CV_RMSE[["rf_p"]],
"CV AUC" = CV_AUC[["rf_p"]],
"Avg of optimal thresholds" = best_tresholds[["rf_p"]],
"Threshold for Fold5" = best_treshold$threshold,
"Avg expected loss" = expected_loss[["rf_p"]],
"Expected loss for Fold5" = expected_loss_cv[[fold]])
kable(x = rf_summary, format = "html", booktabs=TRUE, digits = 3, row.names = TRUE,
linesep = "", col.names = c("CV RMSE", "CV AUC",
"Avg of optimal thresholds","Threshold for fold #5",
"Avg expected loss","Expected loss for fold #5"))
| CV RMSE | CV AUC | Avg of optimal thresholds | Threshold for fold #5 | Avg expected loss | Expected loss for fold #5 | |
|---|---|---|---|---|---|---|
| 1 | 0.419 | 0.818 | 0.55 | 0.545 | 0.985 | 0.971 |
# Create plots - this is for Fold5
createLossPlot(roc_obj, best_treshold, "rf_p_loss_plot")
createRocPlotWithOptimal(roc_obj, best_treshold, "rf_p_roc_plot")
# Take model to holdout and estimate RMSE, AUC and expected loss ------------------------------------
rf_predicted_probabilities_holdout <- predict(rf_model_p, newdata = data_holdout, type = "prob")
data_holdout$rf_p_prediction <- rf_predicted_probabilities_holdout[,"X1"]
RMSE(data_holdout$rf_p_prediction, data_holdout$fast_growth)
## [1] 0.4130405
# ROC curve on holdout
roc_obj_holdout <- roc(data_holdout$fast_growth, data_holdout[, "rf_p_prediction", drop=TRUE])
# AUC
as.numeric(roc_obj_holdout$auc)
## [1] 0.8495307
# Get expected loss on holdout with optimal threshold
holdout_treshold <- coords(roc_obj_holdout, x = best_tresholds[["rf_p"]] , input= "threshold",
ret="all", transpose = FALSE)
expected_loss_holdout <- (holdout_treshold$fp*FP + holdout_treshold$fn*FN)/length(data_holdout$fast_growth)
expected_loss_holdout
## [1] 0.976952
# Confusion table on holdout with optimal threshold
holdout_prediction <-
ifelse(data_holdout$rf_p_prediction < best_tresholds[["rf_p"]], "X0", "X1") %>%
factor(levels = c("X0", "X1"))
cm_object4 <- confusionMatrix(holdout_prediction, data_holdout$fast_growth_f)
cm4 <- cm_object4$table
cm4
## Reference
## Prediction X0 X1
## X0 12920 5538
## X1 377 1644
# Summary results ---------------------------------------------------
nvars[["rf_p"]] <- length(rfvars)
summary_results <- data.frame("Number of predictors" = unlist(nvars),
"CV RMSE" = unlist(CV_RMSE),
"CV AUC" = unlist(CV_AUC),
"CV threshold" = unlist(best_tresholds),
"CV expected Loss" = unlist(expected_loss))
model_names <- c("Logit X1", "Logit X6",
"Logit LASSO","RF probability")
summary_results <- summary_results %>%
filter(rownames(.) %in% c("X1", "X6", "LASSO", "rf_p"))
rownames(summary_results) <- model_names
kable(x = summary_results, format = "html", booktabs=TRUE, digits = 3, row.names = TRUE,
linesep = "", col.names = c("Number of predictors", "CV RMSE", "CV AUC",
"CV threshold", "CV expected Loss"))
| Number of predictors | CV RMSE | CV AUC | CV threshold | CV expected Loss | |
|---|---|---|---|---|---|
| Logit X1 | 18 | 0.475 | 0.554 | 0.549 | 1.051 |
| Logit X6 | 115 | 0.463 | 0.636 | 0.721 | 1.049 |
| Logit LASSO | 102 | 0.463 | 0.616 | 0.722 | 1.051 |
| RF probability | 30 | 0.419 | 0.818 | 0.550 | 0.985 |